Skip to content

(0.3.1) Unify JRA55 onto the generic dataset backend, extend region support across all paths#169

Open
simone-silvestri wants to merge 58 commits intomainfrom
ss/dataset-backend-prefetch
Open

(0.3.1) Unify JRA55 onto the generic dataset backend, extend region support across all paths#169
simone-silvestri wants to merge 58 commits intomainfrom
ss/dataset-backend-prefetch

Conversation

@simone-silvestri
Copy link
Copy Markdown
Member

@simone-silvestri simone-silvestri commented Apr 17, 2026

Summary

Two threads, woven together:

  1. JRA55 now uses the same code path as every other dataset. The bespoke JRA55NetCDFBackend and JRA55FieldTimeSeries are gone. JRA55 atmospheric variables flow through Field(::Metadatum) and FieldTimeSeries(::Metadata) like ECCO, EN4, WOA, GLORYS, ERA5, ORCA.

  2. BoundingBox and Column regions now work everywhere. These types existed on main, but Column extraction was a special-cased branch in Field(::Metadatum), regions weren't supported on FieldTimeSeries at all, and JRA55 ignored them entirely. After this PR, regions work uniformly across Field, FieldTimeSeries, every dataset including JRA55, on CPU and GPU.

Closes #18.

What changes for users

Old JRA55 entry points are gone:

# Before
ua = JRA55FieldTimeSeries(:eastward_velocity, arch; backend=JRA55NetCDFBackend(40))
atmosphere = JRA55PrescribedAtmosphere(arch; backend=JRA55NetCDFBackend(40))

New, same shape as every other dataset:

ua = FieldTimeSeries(Metadata(:eastward_velocity; dataset=RepeatYearJRA55()), arch;
                     time_indices_in_memory=40)
atmosphere = JRA55PrescribedAtmosphere(arch; time_indices_in_memory=40)

Regions on any dataset, including JRA55:

# Clip JRA55 to a region
bbox = BoundingBox(longitude=(-30, 30), latitude=(-60, 60))
ua_clipped = FieldTimeSeries(Metadata(:eastward_velocity; dataset=RepeatYearJRA55(),
                                       region=bbox))

# Extract a single Column from ECCO with bilinear interpolation
col = Column(12.0, -50.0; interpolation=Linear())
md  = Metadatum(:temperature; dataset=ECCO4Monthly(), date=..., region=col)
field = Field(md)   # 1×1×Nz Flat/Flat/Bounded

The InMemory() backend for JRA55 is dropped; if you want the whole series in memory, pass time_indices_in_memory = length(metadata).

What was already on main, what we added

Already there:

  • BoundingBox and Column region types (src/DataWrangling/metadata.jl).
  • construct_native_grid dispatches for both region types.
  • Field(::Metadatum) accepting Column metadata via a separate column_field_from_file branch.
  • ECCO velocity-axis "mangling" (ShiftSouth / AverageNorthSouth) for staggered files.

What this PR adds:

  • A single GPU-friendly kernel _set_region_kernel! (src/DataWrangling/set_region_data.jl) that all read paths funnel through. It handles, in one pass:
    • BoundingBox offset (straight indexed read into the clipped target),
    • Column cyclic-aware bilinear/nearest blending,
    • NaN-aware corner handling so _FillValue/land cells don't poison the average,
    • unit conversion,
    • the staggered-velocity mangling (previously a separate path).
  • Field(::Metadatum) rewritten on top of this kernel — the special column_field_from_file branch is gone; both regions go through set_region_data!.
  • FieldTimeSeries(::Metadata) and the JRA55 set! paths also call set_region_data!, so regions work for time series and for JRA55 (neither did before).
  • Cyclic-aware coordinate bracketing (bracket_with_weight with optional period) so a Column at 359.5° wraps correctly across the periodic seam.

Smaller cleanups along the way

  • Dead code removed from the bounding-box helpers (a half-dozen unused dispatches that simplified down to one each after the refactor).
  • Double native-grid rebuild in FieldTimeSeries(::Metadata, grid) collapsed into a single architecture-aware comparison.
  • JRA55 multi-year calendar handling: verified by inspecting a real file that the time axis uses the standard Gregorian calendar (leap days included), so the previous DateTimeNoLeap workaround was unnecessary; the bespoke component-matching helper is gone, replaced with a plain findfirst. Function name and surrounding comments updated to match reality.
  • New tests: test_column_field.jl, test_dataset_region.jl, test_mangling.jl, test_jra55_region.jl, plus additions to test_metadata.jl.

Test plan

  • test/test_jra55.jl, test/test_jra55_region.jl — JRA55 on the new path, with and without regions
  • test/test_column_field.jl — Column extraction (Linear/Nearest, cyclic wrap, NaN-aware blend)
  • test/test_dataset_region.jl — BoundingBox across datasets
  • test/test_mangling.jl — ECCO v_velocity mangling end-to-end
  • test/test_ocean_only_model.jl, test/test_ocean_sea_ice_model.jl — migrated to the new atmosphere signature
  • test/test_checkpointer.jl — JRA55 atmosphere survives JLD2 round-trip on the new path

This commit reworks `src/DataWrangling/` along two related axes:

1. JRA55 is no longer a special-case data source. The bespoke
   `JRA55NetCDFBackend` struct, `JRA55FieldTimeSeries` constructor, and
   `InMemory()` backend support are removed. JRA55 now flows through the
   same `FieldTimeSeries(::Metadata)` / `Field(::Metadatum)` API as
   ECCO4 / EN4 / WOA / GLORYS, with `inpainting = nothing` selecting the
   chunked-yearly NetCDF dispatch path. `JRA55NetCDFBackend(N)` is kept
   as a thin function that returns a `DatasetBackend(N, nothing;
   inpainting=nothing)` so existing call-sites still construct the right
   backend.

2. A new `PrefetchingBackend{B<:DatasetBackend}` wraps any
   `DatasetBackend` and hides the next sliding-window's I/O behind the
   current window's compute via `Threads.@spawn`. Opt in via
   `prefetch=true` on `FieldTimeSeries(::Metadata)`,
   `DatasetRestoring(...)`, or `JRA55PrescribedAtmosphere(...)`.

New / removed
-------------

- New `src/DataWrangling/dataset_backend.jl`: the existing
  `DatasetBackend{N,C,I,M}` extracted from
  `metadata_field_time_series.jl`, plus its constructors / accessors
  and the generic per-file `set!` used by ECCO4 / EN4 / WOA.
- New `src/DataWrangling/prefetching_backend.jl`: the
  `PrefetchingBackend` wrapper, hot/cold-path `set!`, cyclical-wrap
  scheduling, property forwarding so `fts.backend.start` etc. continue
  to address the inner `DatasetBackend`.
- New `retrieve_data(::JRA55Metadatum)` (split per dataset type to
  handle the no-leap calendar issue in multi-year files; see below) so
  that the generic `Field(::Metadatum)` path produces a correct 2D
  slice for any JRA55 metadatum.
- Removed `JRA55FieldTimeSeries`, the `JRA55NetCDFBackend` struct, and
  the JRA55-specific `Adapt.adapt_structure`.
- Updated tests and examples to the new pattern.

Net diff: +487 / −403.

Closes #18.

----

Background — why this matters
-----------------------------

OMIP simulations on the ORCA grid surfaced periodic wall-time spikes
during time stepping:

```
[ Info: iteration: 133680, wall time: 1.780 seconds
[ Info: iteration: 133690, wall time: 23.906 seconds   <-- spike
[ Info: iteration: 133700, wall time: 1.732 seconds
```

Two causes were identified on `ss/omip-prototype`:

1. `set!(fts::JRA55NetCDFFTSMultipleYears)` was opening every yearly
   NetCDF file in the metadata (~60 for the full 1958–2019 atmosphere)
   per reload, even files with no overlap with the current window.
   This added up to ~660 NetCDF opens per reload across 11 atmospheric
   variables.

2. The remaining ~15 s spike (down from ~24 s after fix 1) is the
   actual cost of reading ~2 GB of compressed NetCDF across 11 files,
   and cannot be reduced further without either shrinking the window
   (more frequent spikes) or hiding the I/O behind compute.

On the 1° configuration, this manifests roughly as **~35 s per reload
on the cold path** and **~15 s on the hot path** (after staging files
to fast scratch and applying the per-window file filtering from
`ss/omip-prototype`). Across a year of simulation that is a few percent
of total wall time, dominated by I/O serialised against the time step.

The first fix (per-window filename filtering + per-file `ftsn_loc`) is
ported here as a plain bug fix. The second is what motivates the
`PrefetchingBackend`.

How prefetching works
---------------------

A `PrefetchingBackend` carries an inner `DatasetBackend` plus three
mutable fields: a `Task`, a buffer `FieldTimeSeries` (a clone of the
main FTS whose `data` array is the prefetch destination), and the
absolute `next_start` index that the buffer will hold once its task
completes.

When Oceananigans calls `set!(fts)` after advancing the window:

1. If the pending prefetch's `next_start` matches the requested
   `start`, `wait(task)` (typically a no-op because the read finished
   while the time step was running) then
   `copyto!(parent(fts.data), parent(buffer_fts.data))` — a memory copy.
2. Otherwise — the cold path on first reload, on checkpointer restart,
   or after an unexpected window jump — drain any stale task,
   synchronously load via a one-off clone FTS, then copy.
3. Either way, schedule the next window's load:
   `Threads.@Spawn set!(next_buffer_fts)` with the inner backend
   re-pointed at `mod1(start + length, length(fts.times))`.

The clone-FTS approach (rather than swapping `fts.backend` to the inner
DatasetBackend in place) keeps the type of `fts.backend` stable across
reloads and lets the spawned `set!` dispatch through the existing
JRA55-specific methods without any special-casing.

JRA55 calendar caveat (multi-year)
----------------------------------

JRA55 NetCDF files use a `DateTimeNoLeap` (365-day) calendar internally,
while `all_dates(::MultiYearJRA55, name)` is a `Dates.DateTime` step
range that includes Feb 29 of leap years. `retrieve_data` is therefore
split: `retrieve_data(::RepeatYearJRA55Metadatum)` uses position-based
indexing (safe — repeat year 1990 is itself non-leap), while
`retrieve_data(::MultiYearJRA55Metadatum)` reads the file's time axis
and matches by `(Y, M, D, H, min)` components, sidestepping the
calendar mismatch entirely. A pre-existing analogous bug in
`set!(::JRA55NetCDFFTSMultipleYears)` (`file_times` and `fts.times`
diverge across leap years) is **not** addressed here — flag for a
follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR unifies JRA55 with the generic Metadata / FieldTimeSeries + DatasetBackend pipeline used by other datasets, and introduces an async PrefetchingBackend to overlap sliding-window I/O with compute.

Changes:

  • Extract DatasetBackend into its own module and route FieldTimeSeries(::Metadata) through it (including a prefetch option).
  • Add PrefetchingBackend that preloads the next time window using Threads.@spawn.
  • Refactor JRA55 to use the generic APIs (remove JRA55FieldTimeSeries export; update tests/examples accordingly).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/DataWrangling/dataset_backend.jl New standalone DatasetBackend definition + generic per-file set!.
src/DataWrangling/prefetching_backend.jl New async backend wrapper that prefetches the next sliding window.
src/DataWrangling/metadata_field_time_series.jl Routes FieldTimeSeries(::Metadata) through DatasetBackend; adds prefetch kwarg.
src/DataWrangling/JRA55/JRA55_field_time_series.jl Removes bespoke backend struct; adds JRA55 retrieve_data and JRA55-specific set! on DatasetBackend.
src/DataWrangling/JRA55/JRA55_metadata.jl Makes JRA55 metadata compatible with generic native-grid path; sets default_inpainting(::JRA55Metadata)=nothing.
src/DataWrangling/JRA55/JRA55_prescribed_atmosphere.jl Rewrites atmosphere construction on top of generic FieldTimeSeries(::Metadata) and adds prefetch kwarg.
src/DataWrangling/restoring.jl Threads prefetch=false through DatasetRestoring.
src/DataWrangling/DataWrangling.jl Includes new backend modules.
src/DataWrangling/JRA55/JRA55.jl Updates exports (drops JRA55FieldTimeSeries, adds JRA55NetCDFBackend).
src/NumericalEarth.jl Removes JRA55FieldTimeSeries from exports.
test/test_jra55.jl Updates tests to use generic FieldTimeSeries(Metadata(...)) and adds prefetching regression coverage.
test/test_downloading.jl Updates JRA55 download test to use FieldTimeSeries(Metadata(...)).
examples/inspect_JRA55_data.jl Updates example to the new JRA55 access pattern.
Comments suppressed due to low confidence (1)

src/DataWrangling/prefetching_backend.jl:126

  • set!(::PrefetchingFTS) allocates a brand-new FieldTimeSeries buffer on every reload (next_fts = buffer_field_time_series(...)). For large windows this implies repeated large allocations (and GC pressure) each time the in-memory window advances, which can easily dominate runtime and memory usage.

A more scalable approach is to allocate the buffer FTS once (or use a small fixed ring of 1–2 buffers), then reuse it by updating its backend to the next window start before spawning the task, rather than constructing a fresh FTS each time.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/inspect_JRA55_data.jl
Comment thread src/DataWrangling/metadata_field_time_series.jl Outdated
Comment thread src/DataWrangling/JRA55/JRA55_field_time_series.jl
Comment thread src/DataWrangling/JRA55/JRA55_prescribed_atmosphere.jl
Comment thread src/DataWrangling/prefetching_backend.jl Outdated
Comment thread src/DataWrangling/prefetching_backend.jl Outdated
@glwagner
Copy link
Copy Markdown
Member

Heh, nice idea!

Just to explore designs, would an alternative implementation be to add a prefetching property to FieldTimeSeries?

This would avoid the wrapper on backend and could allow us to make this a default without materialization perhaps.

@simone-silvestri
Copy link
Copy Markdown
Member Author

Just to explore designs, would an alternative implementation be to add a prefetching property to FieldTimeSeries?

Nice idea. We could implement a PR in Oceananigans, FieldTimeSeries would need to carry the prefetch state that would be hooked in into update_field_time_series!. Shall we land this here so we test it and I add a PR in Oceananigans? We probably need some guards since this is a feature that users should not tamper with in postprocessing and when analyzing outputs.

@simone-silvestri
Copy link
Copy Markdown
Member Author

As a part of this PR I will temporarily disable the single_column_os_papa_simulation.jl since it builds a JRA55PrescribedAtmosphere on a single column given latitude and longitude. This feature is discontinued in this PR, the way to do it is to prescribe a region to the metadata after #142

Comment thread src/DataWrangling/ECCO/ECCO_atmosphere.jl Outdated
Comment thread src/DataWrangling/JRA55/JRA55_prescribed_atmosphere.jl Outdated
Copy link
Copy Markdown
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few complaints...

  • constructors like blah_blah_fts don't read well in my opinion. We could use blah_blah_field_time_series. But much better would be BlahBlahFieldTimeSeries. I think this is much more obvious

  • time_indices_in_memory is a pretty awkward kwarg but maybe we are stuck with it

@simone-silvestri simone-silvestri changed the title Unify JRA55 onto generic DatasetBackend Unify JRA55 onto the generic dataset backend, extend region support across all paths Apr 28, 2026
@simone-silvestri simone-silvestri changed the title Unify JRA55 onto the generic dataset backend, extend region support across all paths (0.5.0) Unify JRA55 onto the generic dataset backend, extend region support across all paths Apr 29, 2026
@simone-silvestri simone-silvestri changed the title (0.5.0) Unify JRA55 onto the generic dataset backend, extend region support across all paths (0.3.1) Unify JRA55 onto the generic dataset backend, extend region support across all paths Apr 29, 2026
@simone-silvestri
Copy link
Copy Markdown
Member Author

@glwagner this is ready to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unify JRA55 with all the other datasets

3 participants